Goto

Collaborating Authors

 agent know


Corrigibility as a Singular Target: A Vision for Inherently Reliable Foundation Models

arXiv.org Artificial Intelligence

Foundation models (FMs) face a critical safety challenge: as capabilities scale, instrumental convergence drives default trajectories toward loss of human control, potentially culminating in existential catastrophe. Current alignment approaches struggle with value specification complexity and fail to address emergent power-seeking behaviors. We propose "Corrigibility as a Singular Target" (CAST)-designing FMs whose overriding objective is empowering designated human principals to guide, correct, and control them. This paradigm shift from static value-loading to dynamic human empowerment transforms instrumental drives: self-preservation serves only to maintain the principal's control; goal modification becomes facilitating principal guidance. We present a comprehensive empirical research agenda spanning training methodologies (RLAIF, SFT, synthetic data generation), scalability testing across model sizes, and demonstrations of controlled instructability. Our vision: FMs that become increasingly responsive to human guidance as capabilities grow, offering a path to beneficial AI that remains as tool-like as possible, rather than supplanting human judgment. This addresses the core alignment problem at its source, preventing the default trajectory toward misaligned instrumental convergence.


Reviews: Multi-Agent Common Knowledge Reinforcement Learning

Neural Information Processing Systems

My two biggest complaints center on 1) the illustrative single-step matrix game of section 4.1 and figure 3 and 2) the practical applications of MACKRL. 1) Since the primary role of the single-step matrix game in section 4.1 is illustrative, it should be much clearer what is going on. How are all 3 policies parameterized? What information does each have access to? What is the training data? First, let's focus on the JAL policy. As presented up until this point in the paper, JAL means centralized training *and* execution.


Model-Free v. Model-Based Reinforcement Learning

#artificialintelligence

So you want to learn about Reinforcement Learning? Be prepared to enter into this field with confusion. Words and terminologies that make explanations confusing at best. Well, let's understand what the broad categories of Reinforcement Learning actually are, and the distinctions between them. From there, we can understand the important characteristics of methods belonging to certain categories, and be able to broaden our overall understanding of the field!


Eger

AAAI Conferences

Actions that affect knowledge asymmetrically between agents occur in numerous domains, from card games such as poker to the secure transmission of information. Applications in such domains often depend on reflection over knowledge, including what an agent knows about what other agents know. We are interested in enabling formal specification of these systems which may be used for executable prototyping as well as verification and other formal reasoning. Dynamic Epistemic Logic provides a formal basis for such reasoning, but is often prohibitively cumbersome to use in practice. We present an implementation and macro system called Ostari, backed by a particular flavor of Dynamic Epistemic Logic, which allows us to scale the ideas to more realistic problems. We demonstrate how actions that manipulate agents' beliefs can be written concisely and how this capability can be applied to modeling a popular card game by utilizing our system's ability to execute action sequences, answer queries about knowledge states, and find action sequences to satisfy a particular goal.


Everyone Knows that Everyone Knows: Gossip Protocols for Super Experts

arXiv.org Artificial Intelligence

A gossip protocol is a procedure for sharing secrets in a network. The basic action in a gossip protocol is a telephone call wherein the calling agents exchange all the secrets they know. An agent who knows all secrets is an expert. The usual termination condition is that all agents are experts. Instead, we explore protocols wherein the termination condition is that all agents know that all agents are experts. We call such agents super experts. Additionally, we model that agents who are super experts do not make and do not answer calls. Such agents are called engaged agents. We also model that such gossip protocols are common knowledge among the agents. We investigate conditions under which protocols terminate, both in the synchronous case, where there is a global clock, and in the asynchronous case, where there is not. We show that a commonly known protocol with engaged agents may terminate faster than the same protocol without engaged agents.


Letters to the Editor

AI Magazine

Definition 2. An agent's knowledge is the set of all statements that the agent knows (i.e., the set [s: the agent knows s]). An agent's problem-solving behavior is


A Semantical Account of Progression in the Presence of Defaults

AAAI Conferences

In previous work, we proposed a modal fragment of the situation calculus called ES, which fully captures Reiter's basic action theories. ES also has epistemic features, including only-knowing, which refers to all that an agent knows in the sense of having a knowledge base. While our model of only-knowing has appealing properties in the static case, it appears to be problematic when actions come into play. First of all, its utility seems to be restricted to an agent's initial knowledge base. Second, while it has been shown that only-knowing correctly captures default inferences, this was only in the static case, and undesirable properties appear to arise in the presence of actions.  In this paper, we remedy both of these shortcomings and propose a new dynamic semantics of only-knowing, which is closely related to Lin and Reiter's notion of progression when actions are performed and where defaults behave properly.